SYDE 676 Project Report – Fall 2002 Web Document Clustering Using Phrase-based Document Similarity
نویسنده
چکیده
Measuring the similarity between documents is an essential operation in text mining, especially document clustering. The traditional method of finding the similarity between documents has always been based on extracting individual words from the documents, and using heuristics to give weights to those features. Standard methods in data mining are then used to find the similarity between documents using such features. In this project an information theoretic-based similarity measure is derived based on shared phrases between documents, rather than individual words. The basic concept is finding a metric that makes use of phrases rather than individual words. Two pairwise document similarity measures are proposed, one is corpus-dependent, and the other is corpus-independent. The corpus-independent measure allows for incremental processing of documents. Only the corpus-independent measure was evaluated in this report. The similarity measure is used for clustering web documents, which proved to have superior accuracy over traditional similarity measures. Evaluation of the clustering is performed based on Information Theory measures, specifically using the F-measure and Entropy.
منابع مشابه
A Novel Weighted Phrase-Based Similarity for Web Documents Clustering
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerat...
متن کاملWeb Document Clustering based on Document Structure
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, document structure should be reflected in the underlying data model. This paper presents a framework for web document clustering based on two important concepts. The first one is the web document structure, which is currently ...
متن کاملPhrase-based Document Similarity Based on an Index Graph Model
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes web documents based on phrases, rather than sing...
متن کاملPhrase based Clustering Scheme of Suffix Tree Document Clustering Model
Document clustering is one of the difficult and recent research fields in the search engine research. Most of the existing documents clustering techniques use a group of keywords from each document to cluster the documents. Document clustering arises from information retrieval domains, and “It finds grouping for a set of documents belonging to the same cluster are similar and documents belongs ...
متن کاملTransition Potential Modeling of Land-Cover based on Similarity Weighted Instance-based Learning Procedure and Its Implication in the REDD Project Design Document
Reducing Emissions from Deforestation and Forest Degradation (REDD) is a climate change mitigation strategy employed to reduce the intensity of deforestation and GHGS emissions. In recent decades, drastic land use changes in Mazandaran province caused a substantial reduction in the amount of Hyrcanian forests. The present research based on objectives of REDD projects paid to identify of fore...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003